[SPARK-13763] [SQL] Remove Project when its Child's Output is Nil by gatorsmile · Pull Request #11599 · apache/spark

gatorsmile · 2016-03-09T04:24:36Z

What changes were proposed in this pull request?

As shown in another PR: #11596, we are using SELECT 1 as a dummy table, when the table is used for SQL statements in which a table reference is required, but the contents of the table are not important. For example,

SELECT value FROM (select 1) dummyTable Lateral View explode(array(1,2,3)) adTable as value

Before the PR, the optimized plan contains a useless Project after Optimizer executing the ColumnPruning rule, as shown below:

== Analyzed Logical Plan ==
value: int
Project [value#22]
+- Generate explode(array(1, 2, 3)), true, false, Some(adtable), [value#22]
   +- SubqueryAlias dummyTable
      +- Project [1 AS 1#21]
         +- OneRowRelation$

== Optimized Logical Plan ==
Generate explode([1,2,3]), false, false, Some(adtable), [value#22]
+- Project
   +- OneRowRelation$

After the fix, the optimized plan removed the useless Project, as shown below:

== Optimized Logical Plan ==
Generate explode([1,2,3]), false, false, Some(adtable), [value#22]
+- OneRowRelation$

This PR is to remove Project when its Child's output is Nil

How was this patch tested?

Added a new unit test case into the suite ColumnPruningSuite.scala

gatorsmile · 2016-03-09T04:25:32Z

cc @marmbrus @cloud-fan @dilipbiswal

cloud-fan · 2016-03-09T04:49:20Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala

+  test("Eliminate the Project with an empty projectList") {
+    val input = OneRowRelation
+    val query =
+      Project(Literal(1).as("1") :: Nil, Project(Literal(1).as("1") :: Nil, input)).analyze


Where do you test empty projectList?

When running Optimize.execute(query), the second Project's projectList is pruned to empty at first. Then, the second Project will be removed.

Let me add another case with an empty List too.

SparkQA · 2016-03-09T05:54:12Z

Test build #52721 has finished for PR 11599 at commit 0fa21ac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-09T06:42:00Z

Test build #52724 has finished for PR 11599 at commit a31b1b5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-03-09T06:56:01Z

@cloud-fan Added another two cases. Feel free to let me know if you want me to add more cases. Thanks!

cloud-fan · 2016-03-09T07:25:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

      }

+    // Eliminate the Projects with empty projectList
+    case p @ Project(projectList, child) if projectList.isEmpty => child


I'm thinking of the correctness of this rule. Actually this is not column pruning, but add more columns, as child may have more one columns.

And why this rule case p @ Project(projectList, child) if sameOutput(child.output, p.output) => child can't work?

Because OneRowRelation has no output. So its output is different to its parent Project.

But a Project with empty projectList also has no output right?

case p @ Project(_, l: LeafNode) => p

There is another case above it. Thus, it will stop here.

How about this?

case p @ Project(_, l: LeafNode) if !l.isInstanceOf[OneRowRelation] => p

Then, we do not need the first line.

Yea. As I posted before. I added a new rule that has side-effect to fix this issue too.

Thanks @viirya @cloud-fan !

I am not sure which way is better.

case p @ Project(_, l: LeafNode) if !l.isInstanceOf[OneRowRelation] => p

My concern is the above line looks more hacky than the current PR fix.

Let me respond the original question by @cloud-fan
We will not see an empty Project, if the child has more than one columns. The empty Project only happens after PruningColumns. I am fine, if we want to add an extra rule for eliminating Project only.

how about we just move that case ahead? It seems always safe to apply case p @ Project(projectList, child) if sameOutput(child.output, p.output) => child

I thought we intentionally did it in this way. I am not 100% sure if we might hit any issue because of it. Let me try it and check if we will hit any test case failure.

SparkQA · 2016-03-09T10:27:45Z

Test build #52740 has finished for PR 11599 at commit 68decd1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-03-09T13:46:47Z

nit: we need to update the title and description. Technically we can't remove Project with empty projectList, only when the child also output Nil.

gatorsmile · 2016-03-09T14:58:55Z

Done. The title and PR description are corrected. Thanks!

marmbrus · 2016-03-09T18:28:24Z

Thanks, merging to master.

#### What changes were proposed in this pull request? As shown in another PR: apache#11596, we are using `SELECT 1` as a dummy table, when the table is used for SQL statements in which a table reference is required, but the contents of the table are not important. For example, ```SQL SELECT value FROM (select 1) dummyTable Lateral View explode(array(1,2,3)) adTable as value ``` Before the PR, the optimized plan contains a useless `Project` after Optimizer executing the `ColumnPruning` rule, as shown below: ``` == Analyzed Logical Plan == value: int Project [value#22] +- Generate explode(array(1, 2, 3)), true, false, Some(adtable), [value#22] +- SubqueryAlias dummyTable +- Project [1 AS 1#21] +- OneRowRelation$ == Optimized Logical Plan == Generate explode([1,2,3]), false, false, Some(adtable), [value#22] +- Project +- OneRowRelation$ ``` After the fix, the optimized plan removed the useless `Project`, as shown below: ``` == Optimized Logical Plan == Generate explode([1,2,3]), false, false, Some(adtable), [value#22] +- OneRowRelation$ ``` This PR is to remove `Project` when its Child's output is Nil #### How was this patch tested? Added a new unit test case into the suite `ColumnPruningSuite.scala` Author: gatorsmile <gatorsmile@gmail.com> Closes apache#11599 from gatorsmile/projectOneRowRelation.

remove Project with an empty projectList

0fa21ac

cloud-fan reviewed Mar 9, 2016
View reviewed changes

added two cases.

a31b1b5

cloud-fan reviewed Mar 9, 2016
View reviewed changes

viirya mentioned this pull request Mar 9, 2016

[SPARK-13771][SQL][WIP] Eliminate child columns from project if the project with no references to its child #11602

Closed

reorder it.

68decd1

gatorsmile changed the title ~~[SPARK-13763] [SQL] Remove Project when its projectList is Empty~~ [SPARK-13763] [SQL] Remove Project when its Child's Output is Nil Mar 9, 2016

asfgit closed this in 23369c3 Mar 9, 2016

Conversation

gatorsmile commented Mar 9, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

gatorsmile commented Mar 9, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 9, 2016

Uh oh!

SparkQA commented Mar 9, 2016

Uh oh!

gatorsmile commented Mar 9, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 9, 2016

Uh oh!

cloud-fan commented Mar 9, 2016

Uh oh!

gatorsmile commented Mar 9, 2016

Uh oh!

marmbrus commented Mar 9, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants